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Abstract 



In this paper we study the online learning problem involving rested and restless multiarmed bandits 
I with multiple plays. The system consists of a single player/user and a set of K finite-state discrete-time 

Markov chains (arms) with unknown state spaces and statistics. At each time step the player can play 
M, M < K, arms. The objective of the user is to decide for each step which M of the K arms to 
play over a sequence of trials so as to maximize its long term reward. The restless multiarmed bandit is 
particularly relevant to the application of opportunistic spectrum access (OSA), where a (secondary) user 
has access to a set of K channels, each of time-varying condition as a result of random fading and/or 

> 

00 ■ certain primary users' activities 

o 

(y-j ' We first show that a logarithmic regret algorithm exists for the rested multiarmed bandit problem. We 

' then construct an algorithm for the restless bandit problem which utilizes regenerative cycles of a Markov 

, chain and computes a sample mean based index policy. We show that under mild conditions on the state 

transition probabilities of the Markov chains this algorithm achieves logarithmic regret uniformly over 
time, and that this regret bound is also optimal. 

•rH ' 

I. Introduction 

In this paper we study the online learning problem involving rested and restless multiarmed bandits 
with multiple plays. The system consists of a single player/user and a set of K finite-state discrete-time 
Markov chains (also referred to as arms) with unknown state spaces and statistics. At each time step the 
player can play M, M < K, arms. Each arm played generates a reward depending on the state the arm 
is in when played. The state of an arm is only observed when it is played, and otherwise unknown to 
the user The objective of the user is to decide for each step which M of the K arms to play over a 
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sequence of trials so as to maximize its long term reward. To do so it must use all its past actions and 
observations to essentially learn the quality of each arm (e.g., their expected rewards). We consider two 
cases, one with rested arms where the state of a Markov chain stays frozen unless it's played, the other 
with restless arms where the state of a Markov chain may continue to evolve (accordingly to a possibly 
different law) regardless of the player's actions. 

The above problem is motivated by the following opportunistic spectrum access (OSA) problem. A 
(secondary) user has access to a set of K channels, each of time-varying condition as a result of random 
fading and/or certain primary users' activities. The condition of a channel is assumed to evolve as a 
Markov chain. At each time step, the secondary user (simply referred to as the user for the rest of the 
paper for there is no ambiguity) senses or probes M of the K channels to find out their condition, and 
is allowed to use the channels in a way consistent with their conditions. For instance, good channel 
conditions result in higher data rates or lower power for the user and so on. In some cases channel 
conditions are simply characterized as being available and unavailable, and the user is allowed to use all 
channels sensed to be available. This is modeled as a reward collected by the user, the reward being a 
function of the state of the channel or the Markov chain. 

The restless bandit model is particularly relevant to this application because the state of each Markov 
chain evolves independently of the action of the user. The restless nature of the Markov chains follows 
naturally from the fact that channel conditions are governed by external factors like random fading, 
shadowing, and primary user activity. In the remainder of this paper a channel will also be referred to 
as an arm, the user as player, and probing a channel as playing or selecting an arm. 

Within this context, the user's performance is typically measured by the notion of regret. It is defined 
as the difference between the expected reward that can be gained by an "infeasible" or ideal policy, 
i.e., a policy that requires either a priori knowledge of some or all statistics of the arms or hindsight 
information, and the expected reward of the user's policy. The most commonly used infeasible policy 
is the best single-action policy, that is optimal among all policies that continue to play the same arm. 
An ideal policy could play for instance the arm that has the highest expected reward (which requires 
statistical information but not hindsight). This type of regret is sometimes also referred to as the weak 
regret, see e.g., work by Auer et al. |[T1. In this paper we will only focus on this definition of regret. 
Discussion on possibly stronger regret measures is given in Section |Vll 

This problem is a typical example of the tradeoff between exploration and exploitation. On the one 
hand, the player needs to sufficiently explore all arms so as to discover with accuracy the set of best 
arms and avoid getting stuck playing an inferior one erroneously believed to be in the set of best arms. 
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On the other hand, the player needs to avoid spending too much time sampling the arms and collecting 
statistics and not playing the best arms often enough to get a high return. 

In most prior work on the class of multiarmed bandit problems, originally proposed by Robbins O, the 
rewards are assumed to be independently drawn from a fixed (but unknown) distribution. It's worth noting 
that with this iid assumption on the reward process, whether an arm is rested or restless is inconsequential 
for the following reasons. Since the rewards are independently drawn each time, whether an unselected 
arm remains still or continues to change does not affect the reward the arm produces the next time it 
is played whenever that may be. This is clearly not the case with Markovian rewards. In the rested 
case, since the state is frozen when an arm is not played, the state in which we next observe the arm is 
independent of how much time elapses before we play the arm again. In the restless case, the state of 
an arm continues to evolve, thus the state in which we next observe it is now dependent on the amount 
of time that elapses between two plays of the same arm. This makes the problem significantly more 
difficult. 

Below we briefly summarize the most relevant results in the literature. Lai and Robbins in ||3] model 
rewards as single -parameter univariate densities and give a lower bound on the regret and construct 
policies that achieve this lower bound which are called asymptotically efficient policies. This result is 
extended by Anantharam et al. in H to the case of playing more than one arm at a time. Using a similar 
approach Anantharam et al. in 10 develops index policies that are asymptotically efficient for arms with 
rewards driven by finite, irreducible, aperiodic and rested Markov chains with identical state spaces and 
single -parameter families of stochastic transition matrices. Agrawal in |6l considers sample mean based 
index poUcies for the iid model that achieve O(logn) regret, where n is the total number of plays. Auer 
et al. in Q also proposes sample mean based index policies for iid rewards with bounded support; these 
are derived from |l6l, but are simpler than those in []6l and are not restricted to a specific family of 
distributions. These policies achieve logarithmic regret uniformly over time rather than asymptotically in 
time, but have bigger constant than that in ||3l. In ||8l we showed that the index policy in Q is order 
optimal for Markovian rewards drawn from rested arms but not restricted to single-parameter families, 
under some assumptions on the transition probabilities. Parallel to the work presented here, in ||9] an 
algorithm was constructed that achieves logarithmic regret for the restless bandit problem. The mechanism 
behind this algorithm however is quite different from what's presented here; this difference is discussed 
in more detail in Section |Vl] 

Other works such as lITOl . lITTI . llT2l consider the iid reward case in a decentralized multiplayer setting; 
players selecting the same arms experience collision according to a certain collision model. We would 
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like to mention another class of multiarmed bandit problems in which the statistics of the arms are 
known a priori and the state is observed perfectly; these are thus optimization problems rather than 
learning problems. The rested case is considered by Gittins |[T3l and the optimal policy is proved to be 
an index policy which at each time plays the arm with highest Gittins' index. Whittle introduced the 
restless version of the bandit problem in [14|. The restless bandit problem does not have a known general 
solution though special cases may be solved. For instance, a myopic policy is shown to be optimal when 
channels are identical and bursty in lITSl for an OSA problem formulated as a restless bandit problem 
with each channel modeled as a two-state Markov chain (the Gilbert-Elliot model). 

In this paper we first study the rested bandit problem with Markovian rewards. Specifically, we show 
that a straightforward extension of the UCB 1 algorithm ||7] to the multiple play case (UCB 1 was originally 
designed for the case of a single play: M = 1) results in logarithmic regret for restless bandits with 
Markovian rewards. We then use the key difference between rested and restless bandits to construct a 
regenerative cycle algorithm (RCA) that produces logarithmic regret for the restless bandit problem. The 
construction of this algorithm allows us to use the proof of the rested problem as a natural stepping 
stone, and simplifies the presentation of the main conceptual idea. 

The work presented in this paper extends our previous results HI, lfT6l on single play to multiple 
plays (M > 1). Note that this single player model with multiple plays at each time step is conceptually 
equivalent to the centralized (coordinated) learning by multiple players, each playing a single arm at 
each time step. Indeed our proof takes this latter point of view for ease of exposition, and our results on 
logarithmic regret equally applies to both cases. 

The remainder of this paper is organized as follows. In Section JI] we present the problem formulation. 
In Section [III] we analyze a sample mean based algorithm for the rested bandit problem. In Section HV] we 
propose an algorithm based on regenerative cycles that employs sample mean based indices and analyze 
its regret. In Section |V] we numerically examine the performance of this algorithm in the case of an 
OSA problem with Gilbert-Elliot channel model. In Section |Vl] we discuss possible improvements and 
compare our algorithm to other algorithms. Section IViT] concludes the paper. 

II. Problem Formulation and Preliminaries 

Consider K arms (or channels) indexed by the set /C = {1,2, . . . ,K}. The ith arm is modeled as a 
discrete-time, irreducible and aperiodic Markov chain with finite state space 5"*. There is a stationary and 
positive reward associated with each state of each arm. Let denote the reward obtained from state x 
of arm z, x G 5"*; this reward is in general different for different states. Let = {p^^y^ x,y ^ S"^^ denote 
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the transition probability matrix of the z-th arm, and vr* = {vr*,x E 5*} the stationary distribution of P*. 

We assume the arms (the Markov chains) are mutually independent. In subsequent sections we will 
consider the rested and the restless cases separately. As mentioned in the introduction, the state of a 
rested arm changes according to only when it is played and remains frozen otherwise. By contrast, 
the state of a restless arm changes according to regardless of the user's actions. All the assumptions in 
this section applies to both types of arms. We note that the rested model is a special case of the restless 
model, but our development under the restless model follows the rested modej^ . 

Let (P*)' denote the adjoint of on hiT^) where 

and P* = {P^yP denotes the multiplicative symmetrization of P*. We will assume that the P*s are such 
that P*s are irreducible. To give a sense of how weak or strong this assumption is, we first note that 
this is a weaker condition than assuming the Markov chains to be reversible. In addition, we note that 
one condition that guarantees the P*s are irreducible is pxx > 0, Vx G S'',\li. This assumption thus holds 
naturally for our main motivating application, as it's possible for channel condition to remain the same 
over a single time step (especially if the unit is sufficiently small). It also holds for a very large class 
of Markov chains and applications in general. Consider for instance a queueing system scenario where 
an arm denotes a server and the Markov chain models its queue length, in which it is possible for the 
queue length to remain the same over one time unit. 

The mean reward of arm i, denoted by is the expected reward of arm i under its stationary 
distribution: 

= E • (1) 

Consistent with the discrete time Markov chain model, we will assume that the player's actions occur 
in discrete time steps. Time is indexed by t, t = 1,2, • • • . We will also frequently refer to the time 
interval {t — l,t\ as time slot t. The player plays M of the K arms at each time step. 

Throughout the analysis we will make the additional assumption that the mean reward of arm M is 
strictly greater than the mean reward of arm M + 1, i.e., we have /i^ > > • • • > /ti^^ > /U*^+^ > 

' In general a restless arm may be given by two transition probability matrices, an active one (P*) and a passive one (Q*). 
The first describes the state evolution when it is played and the second the state evolution when it is not played. When an arm 
models channel variation, P' and are in general assumed to be the same as the channel variation is uncontrolled. In the 
context of online learning we shall see that the selection of is irrelevant; indeed the arm does not even have to be Markovian 
when it's in the passive mode. More is discussed in Section [Vll 
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• • • > fi^ . For rested arms this assumption simplifies the presentation and is not necessary, i.e., results 
will hold for > /x*^"*"^. However, for restless arms the strict inequality between fi^'^ and is 
needed because otherwise there can be a large number of arm switchings between the A/-th and the 
(M + l)-th arms (possibly more than logarithmic). Strict inequality will prevent this from happening. 
We note that this assumption is not in general restrictive; in our motivating application distinct channel 
conditions typically means different data rates. Possible relaxation of this condition is given in Section 

lYD 

We will refer to the set of arms {1, 2, • • • , M} as the M-best arms and say that each arm in this set is 
optimal while referring to the set {M + 1, M + 2, • • • , K} as the M-worst arms and say that each arm 
in this set is suboptimal. 

For a policy a we define its regret R°'{n) as the difference between the expected total reward that can 
be obtained by only playing the M -best arms and the expected total reward obtained by policy a up to 
time 71. Let A°'{t) denote the set of arms selected by policy a at t, t = 1,2, ■ ■ ■ , and Xa{t) the state of 
arm a{t) € at time t. Then we have 

M 



n 



E 



E E 



(2) 



t=l a(t)GA°(t) 

The objective is to examine how the regret R°'{n) behaves as a function of n for a given policy a and 
to construct a policy whose regret is order-optimal, through appropriate bounding. As we will show and 
as is commonly done, the key to bounding R°^{n) is to bound the expected number of plays of any 
suboptimal arm. 

Our analysis utilizes the following known results on Markov chains; the proofs are not reproduced 
here for brevity. The first result is due to Lezaud lUTI that bounds the probability of a large deviation 
from the stationary distribution. 

Lemma 1: [Theorem 3.3 from Wf\ \ Consider a finite-state, irreducible Markov chain {Xt}^^^ with 
state space S, matrix of transition probabilities P, an initial distribution q and stationary distribution vr. 



Let iVn 



{^,x ^ S) . Let P = P'P be the multiplicative symmetrization of P where P' is the 



adjoint of P on /2(7r). Let e = 1 — A2, where A2 is the second largest eigenvalue of the matrix P. e will 
be referred to as the eigenvalue gap of P. Let / : S — > M be such that '^^y^s'^yf^y) ~ ^' ll/lloo — ^ 
and < II/II2 < 1. If P is irreducible, then for any positive integer n and all < 7 < 1 



n 



> 7 < Nq exp 



n7^e 



28 
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The second result is due to Anantharam et al., which can be found in Q. 

Lemma 2: [Lemma 2. 1 from ||5]] Let Y be an irreducible aperiodic Markov chain with a state space 
S, transition probability matrix P, an initial distribution that is non-zero in all states, and a stationary 
distribution {vTj^jjVx G S. Let Ft be the cj-field generated by random variables Xi, X2, Xt where 
Xt corresponds to the state of the chain at time t. Let G be a cr-field independent of F = Vf>iF(, the 
smallest cr-field containing Fi,F2, .... Let r be a stopping time with respect to the increasing family of 
cj-fields {G y Ft,t> 1}. Define N{x,t) such that 



N{x,T) = Y,I{Xt = x). 
t=i 

Then Vr such that E [t] < 00, we have 

\E[N{x,t)]-7:^E[t]\<Cp, 



(3) 



where Cp is a constant that depends on P. 

The third result is due to Bremaud, which can be found in lITSl . 

Lemma 3: If {Xn}n>o ^ positive recurrent homogeneous Markov chain with state space S, stationary 
distribution vr and r is a stopping time that is finite almost surely for which Xt = x then for all y G 5 



E 



r-l 



S^I{Xt=y)\Xo 



.t=o 



E[t\Xq = x\-Ky . 



The following notations are frequently used throughout the paper: /? = XlSi 1/*^' '''"min ~ ™™a:GS' t^x, 

VTinin = milligK: TTj^in' ^'max = max^je^.^ie^ rl, S'max = maXigx: l-S"*!, VTmax = maXa;65.,igA: {<, 1 - <}> 

emin = niiiijg^ e*, where e' is the eigenvalue gap (the difference between 1 and the second largest 
eigenvalue) of the multiplicative symmetrization of the transition probability matrix of the ith arm, and 
^max = ™axj. yg5. whcrc ^^. y is the mean hitting time of state y given the initial state x for arm 
i for P*. 

In the next two sections we present algorithms for the rested and restless problems, referred to as the 
upper confidence bound - multiple plays (UCB-M) and the regenerative cycle algorithm - multiple plays 
(RCA-M), respectively, and analyze their regret. 

III. Analysis of the Rested Bandit Problem with Multiple Plays 

In this section we show that there exists an algorithm that achieves logarithmic regret uniformly 
over time for the rested bandit problem with Markovian reward and multiple plays. We present such 
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an algorithm, called the upper confidence bound - multiple plays (UCB-M), which is a straightforward 
extension of UCBl from 171. This algorithm plays M of the K arms with the highest indices with a 
modified exploration constant L instead of 2 in Q. Throughout our discussion, we will consider a horizon 
of n time slots. For simplicity of presentation we will view a single player playing multiple arms at each 
time as multiple coordinated players each playing a single arm at each time. In other words we consider 
M players indexed by 1, 2, • • • , M, each playing a single arm at a time. Since in this case information 
is centralized, collision is completely avoided among the players, i.e., at each time step an arm will be 
played by at most one player 

Below we summarize a list of notations used in this section. 

• A{t): the set of arms played at time t (or in slot t). 

• T^{t): total number of times (slots) arm i is played up to the end of slot t. 

• T^'^{t): total number of times (slots) player j played arm i up to the end of slot t. 

• f^{T^{t)): sample mean of the rewards observed from the first r*(t) plays of arm i. 

As shown in Figure [T] UCB-M selects M channels with the highest indices at each time step and 
updates the indices according to the rewards observed. The index given on line 4 of Figure [U depends on 
the sample mean reward and an exploration term which reflects the relative uncertainty about the sample 
mean of an arm. We call L in the exploration term the exploration constant. The exploration term grows 
logarithmically when the arm is not played in order to guarantee that sufficient samples are taken from 
each arm to approximate the mean reward. 

The Upper Confidence Bound - Multiple Plays (UCB-M): 
1: Initialize: Play each arm M times in the first K slots 
2: while t > A' do 

3: f\T\t)) = r^W+r'[2)+.^r^iTm ^ 

4: calculate index: c/j ,^.,^^) = r\T\t)) + Vi 
5: t:=t+l 

6: play M arms with the highest indices, update r^t) and T^{t). 
7: end while 

Fig. 1. pseudocode for the UCB-M algorithm. 

To upper bound the regret of the above algorithm logarithmically, we proceed as follows. We begin 
by relating the regret to the expected number of plays of the arms and then show that each suboptimal 
arm is played at most logarithmically in expectation. These steps are illustrated in the following lemmas. 
Most of these lemmas are established under the following condition on the arms. 
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Condition 1: All amis are finite-state, iixeducible, aperiodic Markov chains whose transition probability 
matrices have irreducible multiplicative symmetrizations and rj. > 0, Vi G /C, Vx G 5*. 

Lemma 4: Assume that all arms are finite-state, irreducible, aperiodic, rested Markov chains. Then 
using UCB-M we have: 

(M K \ 

nY.^i' -Y.^J^'E[T{n)]\ < Cs,P,r, (4) 
3=1 i=l J 

where Cs,p,r is a constant that depends on the state spaces, rewards, and transition probabilities but not 
on time. 

Proof: see Appendix lAl ■ 
Lemma 5: Assume Condition [T]holds and all arms are rested. Under UCB-M with L > 112S'^j^x''max^max/fmin, 
for any suboptimal arm i, we have 

Proof: see Appendix ICl ■ 
Theorem 1: Assume Condition [T] holds and all arms are rested. With constant L > 
the regret of UCB-M is upper bounded by 

Rin) < 4L In n 7^^^^ + E " ( + E ] + Cs,P,r, 

i>M ^ > i>M \ 



(5) 



where Q, = MMW. 
Proof: 



M K 

nY.fi' in)] 

3=1 «=1 



M K M K 

j=l i=l j=l i=l 

M 

j=l i>M i>M 



10 



Thus, 

M K 

R{n) < nJ]/i^-5]/i^ii;[r(n)] + Cs,P,r (6) 

j=i i=i 

< J](/-/i^)i?[r(?l)] + Cs,P,r 

i>M 

/ 1 i J.r 4Llnn ^ (1^1 + l-S^I)/?^ ^ 

i>M \ j=l "^"^^ / 

i>A/^'^ »>A/ \ j=l / 

where (|6]l follows from Lemma |4] and (|7]l follows from Lemma |5] ■ 
The above theorem says that provided that L satisfies the stated sufficient condition, UCB-M results 
in logarithmic regret for the rested problem. This sufficient condition does require certain knowledge on 
the underlying Markov chains. This requirement may be removed if the value of L is adapted over time. 
More is discussed in Section |Vll 

IV. Analysis of the Restless Bandit Problem with Multiple Plays 

In this section we study the restless bandit problem. We construct an algorithm called the regenerative 
cycle algorithm - multiple plays (RCA-M), and prove that this algorithm guarantees logarithmic regret 
uniformly over time under the same mild assumptions on the state transition probabilities as in the rested 
case. RCA-M is a multiple plays extension of RCA first introduced in |[T6l . Below we first present the key 
conceptual idea behind RCA-M, followed by a more detailed pseudocode. We then prove the logarithmic 
regret result. 

As the name suggests, RCA-M operates in regenerative cycles. In essence RCA-M uses the observations 
from sample paths within regenerative cycles to estimate the sample mean of an arm in the form of an 
index similar to that used in UCB-M while discarding the rest of the observations (only for the computation 
of the index, but they are added to the total reward). Note that the rewards from the discarded observations 
are collected but are not used to make decisions. The reason behind such a construction has to do with the 
restless nature of the arms. Since each arm continues to evolve according to the Markov chain regardless 
of the user's action, the probability distribution of the reward we get by playing an arm is a function 
of the amount of time that has elapsed since the last time we played the same arm. Since the arms are 
not played continuously, the sequence of observations from an arm which is not played consecutively 
does not correspond to a discrete time homogeneous Markov chain. While this certainly does not affect 
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our ability to collect rewards, it becomes hard to analyze the estimated quality (the index) of an arm 
calculated based on rewards collected this way. 

However, if instead of the actual sample path of observations from an arm, we limit ourselves to a 
sample path constructed (or rather stitched together) using only the observations from regenerative cycles, 
then this sample path essentially has the same statistics as the original Markov chain due to the renewal 
property and one can now use the sample mean of the rewards from the regenerative sample paths to 
approximate the mean reward under stationary distribution. 

Under RCA-M each player maintains a block structure; a block consists of a certain number of slots. 
Recall that as mentioned earlier, even though our basic model is one of single -player multiple-play, our 
description is in the equivalent form of multiple coordinated players each with a single play. Within a 
block a player plays the same arm continuously till a certain pre-specified state (say 7*) is observed. 
Upon this observation the arm enters a regenerative cycle and the player continues to play the same arm 
till state 7* is observed for the second time, which denotes the end of the block. Since M arms are 
played (by M players) simultaneously in each slot, different blocks overlap in time. Multiple blocks may 
or may not start or end at the same time. In our analysis below blocks will be ordered; they are ordered 
according to their start time. If multiple blocks start at the same time then the ordering among them is 
randomly chosen. 

For the purpose of index computation and subsequent analysis, each block is further broken into three 
sub-blocks (SBs). SBl consists of all time slots from the beginning of the block to right before the first 
visit to 7*; SB2 includes all time slots from the first visit to 7* up to but excluding the second visit 
to state 7*; SB3 consists of a single time slot with the second visit to 7*. Figure |2] shows an example 
sample path of the operation of RCA-M. The block structure of two players are shown in this example; 
the ordering of the blocks is also shown. 

The key to the RCA-M algorithm is for each arm to single out only observations within SB2's in each 
block and virtually assemble them. Throughout our discussion, we will consider a horizon of n time 
slots. A list of notations used is summarized as follows: 

• A{t): the set of arms played at time t (or in time slot t). 

• 7*: the state that determines the regenerative cycles for arm i. 

• a (6): the arm played in the 6-th block. 

• b{n): the total number of completed blocks by all players up to time n. 

• T{n): the time at the end of the last completed block across all arms (see Figure |2]l. 

• T*(n): the total number of times (slots) arm i is played up to the last completed block of arm i up 
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2. Example realization of RCA-M with M = 2 for a period of n slots 



to time T{n). 

T^'^{n): the total number of times (slots) arm i is played by user j up to the last completed block 
of arm i up to time T{n) 

B^{b): the total number of blocks within the first completed h blocks in which arm i is played. 
X\{h): the vector of observed states from SBl of the 6-th block in which arm i is played; this vector 
is empty if the first observed state is 7*. 

X2{h): the vector of observed states from SB2 of the 6-th block in which arm i is played; 

the vector of observed states from the 6-th block in which arm i is played. Thus we have 

X\h) = [X\{b),Xl{h),^% 
t{h): time at the end of block h; 

T^{t{b)): the total number of time slots arm i is played up to the last completed block of arm i 
within time t{b). 

t2{b): the total number of time slots that lie within at least one SB2 in a completed block of any 
arm up to and including block b. 

r^{t): the reward from arm i upon its t-th play, counting only those plays during an SB2. 
T2{t2{b)): the total number of time slots arm i is played during SB2's up to and including block b. 
0{b): the set of arms that arc free to be selected by some player i upon its completion of the 5-th 
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block; these are arms that are currently not being played by other players (during time slot t{b)), 
and the arms whose blocks are completed at time t{b). 
RCA-M computes and updates the value of an index g"^ for each arm i in the set 0{b) at the end of 
block h based on the total reward obtained from arm i during all SB2's as follows: 



where L is a constant, and 

-i(rj.u, (1) + (2) + ■ . ■ + (T^ (fe))) 

denotes the sample mean of the reward collected during SB2. Note that this is the same way the index 
is computed under UCB-M if we only consider SB2's. Its also worth noting that under RCA-M rewards 
are also collected during SBl's and SB3's. However, the computation of the indices only relies on SB2. 
The pseudocode of RCA-M is given in Figure [3l 

Due to the regenerative nature of the Markov chains, the rewards used in the computation of the index 
of an arm can be viewed as rewards from a rested arm with the same transition matrix as the active 
transition matrix of the restless arm. However, to prove the existence of a logarithmic upper bound on 
the regret for restless arms remains a non-trivial task since the blocks may be arbitrarily long and the 
frequency of arm selection depends on the length of the blocks. 

In the analysis that follows, we first show that the expected number of blocks in which a suboptimal 
arm is played is at most logarithmic by applying the result in Lemma |7] that compares the indices of arms 
in slots where an arm is selected. Using this result we then show that the expected number of blocks 
in which a suboptimal arm is played is at most logarithmic in time. Using irreducibility of the arms the 
expected block length is finite, thus the number of time slots in which a suboptimal arm is played is 
finite. Finally, we show that the regret due to arm switching is at most logarithmic. 

We bound the expected number of plays from a suboptimal arm. 

Lemma 6: Assume Condition [T] holds and all arms are restless. Under RCA-M with a constant L > 

1125j^j^x^max^max/*^min, WC haVC 

i>M i>M ^ ' i>M \ j=l 
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The Regenerative Cycle Algorithm - Multiple Plays (RCA-M): 


1 


initialize. o = i,t = u, 12 = 0,^2 ^, r = U, 1 ^in = J^;Vi = i,-- - , K , A = ^ 


2 


indicates whether arm i has been played at least once 


3 


IIIgg2 indicates whether arm i is in an SB2 sub-block 


4 


while (1) do 


5 


tor i = 1 to A do 


6 


11 = 1 and \A\ < M then 


7 


y4 ^ A U {zj //arms never played is given pnonty to ensure all arms are sampled initially 


8 


end it 


9 


end for 


10 


it \A\ < M then 


1 1 


/\uu Lo /I Liie scL i t . (/ IS one or uie ivi — Lri loigesL aiiioiig it/ ,A.tii, 5-^*-/ — / i 


12 


//for arms that have been played at least once, those with the largest indices are selected 


13 


end if 


14 


for i £ A do 


15 


play arm i; denote state observed by rr* 


1 

Id 


11 — i men 


17 


7' = X\ ±2 ■= J-2 + 1' ^ •= + K'' ^IN = rgQ2 = 1 


1 o 

is 


//the first observed state becomes the regenerative state; the arm enters SB 2 


19 


else it X 7^ 7 and ^5^2 ~ then 


20 


n~>i n~'i i i ^ i \ i 

T2 := + 1, r' := r' + r^. 


21 


else if X* = 7* and ^5^2 ~ '-^ then 


22 


Ti := r| + 1, r' := r' + r^, = 1 


23 


else if T* = and TI^t^^ = 1 then 


0/1 


— \ „i ji — C\ A ^ A 

r .— r + r^,, i^^a — u, - 


9^ 

J 


pnH if 
ciiu 11 


26 


end for 


27 


t := t + 1, i2 := ^2 + mill (l) Sie5 ^332} ^^^2 is only accumulated if at least one arm is in SB2 


28 


for i = 1 to K do 


29 


_ r' 1 / L\nt2 


30 


end for 


31 


end while 



Fig. 3. Pseudocode of RCA-M 



where 



"mm V^min / 

Proof: see Appendix 10 ■ 
We now state the main result of this section. 

Theorem 2: Assume Condition[T]holds and all arms are restless. With constant L > 1125^ax^max^max/£r 
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the regret of RCA-M is upper bounded by 

i>M ^ ' 

+ Y,{{^,^ - ^i')D, + Ei)il + MY,C^\+F 

i>M \ j=l I 



where 



Di = ( + M' + 1 



M 



^ / 1 \ 



i=i 

Proof: see Appendix IB ■ 
Theorem |2] suggests that given minimal information about the arms such as an upper bound for 
'S'max''max^max/^min the player Can guarantee logarithmic regret by choosing an L in RCA-M that satisfies 
the stated condition. As the rested case, this requirement on L can be completely removed if the value 
of L is adapted over time; more is discussed in Section |Vl] 

We conjecture that the order optimality of RCA-M holds when it is used with any index policy that 
is order optimal for the rested bandit problem. Because of the use of regenerative cycles in RCA-M, the 
observations used to calculate the indices can be in effect treated as coming from rested arms. Thus an 
approach similar to the one used in the proof of Theorem 0can be used to prove order optimality of 
combinations of RCA-M and other index policies. 

V. An Example for OSA: Gilbert-Elliot Channel Model 

In this section we simulate RCA-M under the commonly used Gilbert-Elliot channel model where each 
channel has two states, good and bad (or 1, 0, respectively). We assume that channel state transitions are 
caused by primary user activity, therefore the problem reduces to the OSA problem. For any channel i, 
r\ = 1, Tq = 0.1. We simulate RCA-M in four environments with different state transition probabilities. 
We compute the normalized regret values, i.e., the regret per single play R{n)/M by averaging the results 
of 100 runs. 
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The state transition probabilities are given in Table U and the mean rewards of the channels under 
these state transition probabilities are given in Table Hi] The four environment, denoted as SI, S2, S3 
and S4, respectively, are summarized as follows. In SI channels are bursty with mean rewards not close 
to each other; in S2 channels are non-bursty with mean rewards not close to each other; in S3 there are 
bursty and non-bursty channels with mean rewards not close to each other; and in S4 there are bursty 
and non-bursty channels with mean rewards close to each other. 

In Figures |4l |6l [H [TOl we observe the normalized regret of RCA-M for the minimum values of L such 
that the logarithmic bound hold. However, comparing with Figures |5J |7j |9j [TT] we see that the normalized 
regret is smaller for L = 1. Therefore the condition on L we have for the logarithmic bound, while 
sufficient, does not appear necessary. We also observe that for the Gilbert-Elliot channel model the regret 
can be smaller when L is set to a value smaller than 112S'^axr^ax^max/fmin- 



channel 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


SI, pm 


0.01 


0.01 


0.02 


0.02 


0.03 


0.03 


0.04 


0.04 


0.05 


0.05 


SI, pio 


0.08 


0.07 


0.08 


0.07 


0.08 


0.07 


0.02 


0.01 


0.02 


0.01 


S2, poi 


0.1 


0.1 


0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


0.8 


0.9 


S2, pio 


0.9 


0.9 


0.8 


0.7 


0.6 


0.5 


0.4 


0.3 


0.2 


0.1 


S3, pqi 


0.01 


0.1 


0.02 


0.3 


0.04 


0.5 


0.06 


0.7 


0.08 


0.9 


S3, pio 


0.09 


0.9 


0.08 


0.7 


0.06 


0.5 


0.04 


0.3 


0.02 


0.1 


S4, Pqi 


0.02 


0.04 


0.04 


0.5 


0.06 


0.05 


0.7 


0.8 


0.9 


0.9 


S4, pio 


0.03 


0.03 


0.04 


0.4 


0.05 


0.06 


0.6 


0.7 


0.8 


0.9 



TABLE I 
Transition probabilities 



channel 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


SI 


0.20 


0.21 


0.28 


0.30 


0.35 


0.37 


0.70 


0.82 


0.74 


0.85 


S2 


0.19 


0.19 


0.28 


0.37 


0.46 


0.55 


0.64 


0.73 


0.82 


0.91 


S3 


0.19 


0.19 


0.28 


0.37 


0.46 


0.55 


0.64 


0.73 


0.82 


0.91 


S4 


0.460 


0.614 


0.550 


0.600 


0.591 


0.509 


0.585 


0.580 


0.577 


0.550 



TABLE II 

Mean rewards 
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0123456789 10 0123456789 10 

" XIO' " xlO^ 



Fig. 4. Normalized regret under SI, L = 7200 Fig. 5. Normalized regret under SI, L = 1 




0123456789 10 0123456789 10 



Fig. 6. Normalized regret under S2, L — 360 Fig. 7. Normalized regret under S2, L — 1 

VI. Discussion 

In this section we discuss how the performance of RCA-M may be improved (in terms of the constants 
and not in order), and possible relaxation and extensions. 

A. Applicability, Performance Improvement, and Relaxation 

We note that the same logarithmic bound derived in this paper holds for the general restless bandit 
where the state evolution is given by two matrices: the active and passive transition probability matrices 
(P* and respectively for arm i), which are potentially different. The addition of a different does 
not affect the analysis because the reward to the player from an arm is determined only by the active 
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transition probability matrix and the first state after a discontinuity in playing the arm. Since the number 
of plays from any suboptimal arm is logarithmic and the expected hitting time of any state is finite the 
regret due to is at most logarithmic. We further note that for the same reason the arm may not even 
follow a Markovian rule in the passive state, and the same logarithmic bound will continue to hold. 

The regenerative state for an arm under RCA-M is chosen based on the random initial observation. 
This means that RCA-M may happen upon a state with long recurrence time which will result in long 
SB 1 and SB2 sub-blocks. We propose the following modification: RCA-M records all observations from 
all arms. Let ki{s, t) be the total number of observations from arm i up to time t that are excluded from 
the computation of the index of arm i when the regenerative state is s. Recall that the index of an arm is 
computed based on observations from regenerative cycles; this implies that ki{s, t) is the total number of 
slots in SBl's when the regenerative state is s. Let tn be the time at the end of the n-th block. If the arm 
to be played in the n-th block is i then the regenerative state is set to 7*(n) = argmiiisg^. ki{s,tn~i)- 
The idea behind this modification is to estimate the state with the smallest recurrence time and choose 
the regenerative cycles according to this state. With this modification the number of observations that 
does not contribute to the index computation and the probability of choosing a suboptimal arm can be 
minimized over time. 

It's also worth noting that the selection of the regenerative state 7' in each block in general can be 
arbitrary: within the same SB2, we can start and end in different states. As long as we guarantee that 
two successive SB2's end and start with the same state, we will have a continuous sample path for which 
our analysis in Section HVl holds. 



3500 




Fig. 8. Nonnalized regret under S3, L — 3600 



Fig. 9. Nonnalized regret under S3, L — 1 
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Fig. 10. Normalized regret under S4, L = 7200 Fig. 11. Normalized regret under S4, L = 1 

B. Relaxation of Certain Conditions 

We have noted in Section |V] that the condition on L while sufficient does not appear necessary for 
the logarithmic regret bound to hold. Indeed our examples show that smaller regret can be achieved by 
setting L = 1. Note that this condition on L originates from the large deviation bound by Lezaud given 
in Lemma [T] This condition can be relaxed if we use a tighter large deviation bound. 

We further note that even if no information is available on the underlying Markov chains to derive 
this sufficient condition on L, an o{log{n) f [n]) regret is achievable by letting L grow slowly with time 
where f{n) is any increasing sequence. Such approach has been used in other settings and algorithms, 
see e.g., CB, H. 

We have noted earlier that the strict inequality /i^^ > is required for the restless multiarmed 

bandit problem because in order to have logarithmic regret, we can have no more than a logarithmic 
number of discontinuities from the optimal arms. When ^u^^ = /x^"'"^ the rankings of the indices of arms 
M and M + 1 can oscillate indefinitely resulting in a large number of discontinuities. Below we briefly 
discuss how to resolve this issue if indeed ^u*^ = Consider adding a threshold e to the algorithm 

such that a new arm will be selected instead of an arm currently being played only if the index of that 
arm is at least e larger than the index of the cuiTcntly played arm which has the smallest index among 
all currently played arms. Then given that e is sufficiently small (with respect to the differences of mean 
rewards) indefinite switching between the M-th and the M + 1-th arms can be avoided. However, further 
analysis is needed to verify that this approach will result in logarithmic regret. 
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C. Definition of Regret 

We have used the weak regret measure throughout this paper, which compares the learning strategy 
with the best single-action strategy. When the statistics are known a priori, it is clear that in general the 
best one can do is not a single-action policy (in principle one can drive such a policy using dynamic 
programming). Ideally one could try to adopt a regret measure with respect to this optimal policy. 
However, such an optimal policy in the restless case is not known in general |[T4l . |[T9l . which makes 
the comparison intractable, except for some very limited cases when such a policy happens to be known 

D. Extensions to A Decentralized Multiplayer Setting and Comparison with Similar Work 

As mentioned in the introduction, there has been a number of recent studies extending single player 
algorithms to multi-player settings where collisions are possible |[2ll . ifTTI . Within this context we note 
that RCA-M in its currently form does not extend in a straightforward way to a decentralized multi- 
player setting. It remains an interesting subject of future study. A recent work Q considers the same 
restless multiarmed bandit problem studied in the present paper. They achieve logarithmic regret by using 
exploration and exploitation blocks that grow geometrically with time. The construction in ||9l is very 
different from ours, but is amenable to multi-player extension |[2TI due to the constant, though growing, 
nature of the block length which can be synchronized among players. 

It is interesting to note that the essence behind our approach RCA-M is to reduce a restless bandit 
problem to a rested bandit problem; this done by sampling in a way to construct a continuous sample 
path, which then allows us to use the same set of large deviation bounds over this reconstructed, entire 
sample path. By contrast, the method introduced in @ applies large deviation bounds to individual 
segments (blocks) of the observed sample path (which is not a continuous sample path representative of 
the underlying Markov chain because the chain is restless); this necessitates the need to precisely control 
the length and the number of these blocks, i.e., they must grow in length over time. Another difference 
is that under our scheme, the exploration and exploitation are done simultaneously and implicitly through 
the use of the index, whereas under the scheme in ||9], the two are done separately and explicitly through 
two different types of blocks. 

VII. Conclusion 

In this paper we considered the rested and restless multiarmed bandit problem with Markovian rewards 
and multiple plays. We showed that a simple extension to UCBl produces logarithmic regret uniformly 
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over time. We then constructed an algorithm RCA-M that utilizes regenerative cycles of a Markov chain 
to compute a sample mean based index policy. The sampling approach reduces a restless bandit problem 
to the rested version, and we showed that under mild conditions on the state transition probabilities of 
the Markov chains this algorithm achieves logarithmic regret uniformly over time for the restless bandit 
problem, and that this regret bound is also optimal. We numerically examine the performance of this 
algorithm in the case of an OSA problem with the Gilbert-Elliot channel model. 
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Appendix A 
PROOF OF Lemma [4] 

Let X^'^{t) be the state observed from the tth play of arm i by player j and T'^'^{n) be the total 
number of times player j played arm i up to and including time n. Then we have, 



where 



M K 
R{n)- InJ]/.^ - j;^^'ii;[r(n)] 

j=i i=i 



E 



j=i 4=1 xeS' t=i 



M K 

EEE^x^[^''H] 

j=l i=l x£S^ 



M K 



j=i i=i xes* 

M K 

^ EEE<^^- = ^«.p.r 

j=l i=l xg5' 



N\x,r'\n))= I{X^'\t) 
t=i 



(9) 



and (|9l) follows from Lemma |2] using the fact that T^'^n) is a stopping time with respect to the cr-field 
generated by the arms played up to time n. 



Appendix B 

Lemma 7: Assume Condition [T] holds and all arms are rested. Let gl g = f*(s) + ct^s, ct^s = \/Llnt/s. 
Under UCB-M with constant L > 1125^ax^max^max/£min^ for any suboptimal arm i and optimal arm j 
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we have 



E 



n t-1 t-1 



t=l W = l Wi=l 



(10) 



where / 



Proof: First, we show that for any suboptimal arm i and optimal arm j, we have that gj^ < gl^, 
implies at least one of the following holds: 



(11) 

(12) 
(13) 



This is because if none of the above holds, then we must have 



9t,w = '^^i'^) + '^t,w > IJ-^ > fJ-' + 2ct,w, > r'{wi) + ct,w, = glw,^ 



which contradicts gj^ < glw,- 

If we choose Wi > 4Llnn/(/u^^ — fi^)"^, then 



Wi 



/Lint lLlnt(a^^ - u}^ 



4L In n 



which means ( fT3l) is false, and therefore at least one of (fTTT i and ([T2l ) is true with this choice of Wi. Let 



E 



4Llnn 

n t-1 t-1 



Then we have. 



E E E ^(^U < 5i 



t=l ui=l rOi=Z 



n t-1 t-1 

s EE E 



(F(f'(ti') < m' - c.,„) + P(f'(u.,) > li' + c,,„,)) 



t=l u; = l 



" Wi- 



CX) t-1 t-1 



t=l w=l , 



" Wi- 



Consider an initial distribution q* for the zth arm. We have: 



^E 

2 ?/e5' 



< 
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where the first inequaUty follows from the Minkowski inequality. Let riy (t) denote the number of times 
state y of arm i is observed up to and including the t-th play of arm i. 

P{f\wi) > Ai' + ct,^J 



Consider a sample path oj and the events 



y&S^ 



y&s- 



If w ^ then 



Thus w ^ ^, therefore P{A) < P{B). Then continuing from ([T4] l: 



P (rlnliwi) - Wir^TTl > 



y y - 
. 1/1/-, \ 



yes 

WiCt 



\S'\rm 



yes^ 



\S'\ - 
< - — -t 



28(|S' 



where ([TST i follows from Lemma [T] by letting 



7 



I{Xl = y) - < 



1*^ 1^1/^ J/ "y 

and recalling tt^ = max{7r* , 1 — vr^} (note is irreducible). 
Similarly, we have 



x^y 



w — wr~ 



yes^ 
yes^ 



x+y 



> 



x^y 



< 



where ([TtT i again follows from Lemma [T] The result then follows from combining ([16) and ([TST i 

n t-1 t-1 



E E E ^ 5l 



t=l W = l Wi=l 



'28S,2 _r,2 _jf2 



i=l «)=1 uii = l 



"^min 



oo 



i=l 
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Appendix C 
Proof of Lemma [5] 

Let I be any positive integer and consider a suboptimal arm i. Then, 

n n 

T\n) = M+ I{ieA{t))<M-l + l+ I{ieA{t),T{t-l)>l). (20) 

t=K+l t=K+l 

Consider 

M 



and 

M 



If z/; e then i ^ A{t). Therefore {i € A{t)} C and 

lii £ A{t), T\t-l)>l) < I{uj £ E, T\t-l)>l) 

M 

< E^Kt.w<4th*)' T\t-l)>l). 

Therefore continuing from (|20] l. 

j=l t=K+l 

M n ^ X 

< M - 1 + Z + V V / ( mill 5^'^ < max ^. ) 

j,=l t=A'+l \ - - - - / 

j=l t=K+l w=l Wi=l 
M n t-l t-l 

J = l t=l W = lWi=l 



(21) 
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Using Lemma |7] with I 



4Llnn 



, we have for any suboptimal arm 

M 



E[T\n)] <M + 



4L Inn 



" min 



(22) 



Appendix D 

Lemma 8: Assume Condition [T] holds and all arms are restless. Let glw — ^{w) + Q q ,, 



^Llnt/u;. Under RCA-M with constant L > 112S^g^^r'^g^^7r^i^^/ eaim, for any suboptimal arm i and 



optimal arm j we have 



E 



EE 



t=l W=l Wi=l 



(23) 



where / 



4L Inn 



and,/3 = Et=ii" 



Proof: Note that all the quantities in computing the indices in (1231 ) comes from the intervals 
X2(l), X2(2), • • • Vi G {I,-'' jK}. Since these intervals begin with state 7* and end with a return 
to 7* (but excluding the return visit to 7*), by the strong Markov property the process at these stopping 
times have the same distribution as the original process. Moreover by connecting these intervals together 
we form a continuous sample path which can be viewed as a sample path generated by a Markov chain 
with an transition matrix identical to the original arm. Therefore we can proceed in exactly the same 
way as the proof of Lemma |7] If we choose Si > 4L ln(n)/(/x*^ — /x')^, then for t < t2{b) = n' < n, 
and for any suboptimal arm i and optimal arm j, 



2ct 



The result follows from letting / 



4L Inn 



4Lln(n) 
and using Lemma |7l 



Appendix E 
Proof of Lemma [6] 



Let ct^w = \/Llnt/w, and let / be any positive integer. Then, 



B\b) = l+ I{a{m) = i) 

m=K+l 



<l+ I{a{m) = i,B\m-l) >l) 

m=K+l 



(24) 



28 



Consider any sample path u) and the following sets 

M 



and 



M 



If uj £ then a(m) 7^ i. Therefore {ui : a{m){uj) = i} C E and 



/(a(m) = i, B\m -!)>/)< /(w € B\m -!)>/) 



M 



< 



t2(m-l),T2^(t2(m-l)) 



< 



5't2(m-l),T|(t2(m-l))' -^X^T- " 1) > • 



Therefore continuing from 

A/ fe 

B'ib) < l + E E ^(<(™-i),T2^(M™-i))^^M--i)-^I(M--i))'^^('"-^)^^ 

j=l m=K+l 
M 



<- I +? > / I mm ql , < max Qt n 

\l<w<t^(m—l) ^ t2H)<Wi<t'j{'m—l) ^ ' 

j=l m=K+l 



M b i2(m-l) t2(m-l) 

j=lm=K+l w=l w,=t2il) 
M t2{b) t-1 t-1 

j = l t=l W = lWi=l 



(25) 



(26) 



where as given in dD, gl^ = f^{w) + q^^, and we have assumed that the index value of an arm remains 
the same between two updates. The inequality in (l26l ) follows from the facts that the second outer sum 
in (|26l ) is over time while the second outer sum in (|25T l is over blocks, each block lasts at least two time 
slots and at most M blocks can be completed in each time step. From this point on we use Lemma [8] to 
get 

^ {\S'\ + \S^)p 



E[B\h{n))\h{n) = h]< 
for all suboptimal arms. Therefore, 



4Lliit2(^) 



M 



(27) 



29 



since n > t2{b{n)) almost surely. 

The total number of plays of arm i at the end of block b{n) is equal to the total number of plays of 
arm i during the regenerative cycles of visiting state 7* plus the total number of plays before entering 
the regenerative cycles plus one more play resulting from the last play of the block which is state 7*. 
This gives: 



E[T{n)] < 



1 



Thus, 



i>M 



i>M 



(28) 
(29) 



i>M 



Appendix F 
Proof of Theorem [2] 

Assume that the states which determine the regenerative sample paths are given a priori by 7 = 
[7^ ) • • • 5 7^] ■ This is to simplify the analysis by skipping the initialization stage of the algorithm and 
we will show that this choice does not affect the regret bound. We denote the expectations with respect 
to RCA-M given 7 as E^. First we rewrite the regret in the following form: 



M 



R^{n) = ^n^E^[T{n)]- E^ 



T{n) 

t=l a(t)eA{t) 



M 



+ ^L^E.\n - T{n)] - E^ 



i=i 



E E ' 

t=T{n)+l a{t)eA{t) 



M 



K 



Y^f^'E^mn)] -Y.^x'E, [T\n)] ) - Z, 



n 



M 



i=l 



+ fi^E^ [n - T{n)] - E^ 



E E 



t=T{n)+l a{t)eA{t) 

where for notational convenience, we have used 



Z^{n) = E^ 



T{n) 

t=l a{t)€A{t) 



K 



i=l 
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We have 



M K 

Y,^'E,[T{n)]-Y,^l'E,[r{ 



n] 



i=l 



< 



M K M K 

j=l 1=1 j=l i=l 

M 

j=l i>M 
i>M 



(32) 



Since we can bound (l32l ). i.e. the difference in the brackets in (l30l ) logarithmically using Lemma |6l it 
remains to bound and the difference in (|3T| ). We have 



M 



i=l yeS^ 



i>M j/GS' 
M 



E E ^(^i = y) 

b=l Xi&XHb) 
B^{b(n)) 

E E ^(^i = y) 

b=l XleXiib) 



(33) 



i=l 



i>M V 7' 



+ f^j„ax + lU7[^*(K^))] > 



where the inequality comes from counting only the rewards obtained during the SB2's for all suboptimal 
arms and the last part of the proof of Lemma |6] Applying Lemma |3] to ( [33] ) we get 



'B'ibin)) 

E E ^(^i = y) 

6=1 xiexiib) 



^E,[B\b{n))] . 



Rearranging terms we get 



Z,{n) > R*{n) - Y l^i^^ + 1)^7 [B'ibin))] 



(34) 



i>M 



where 



R*in) 



M 

E E ^l^^ 

i=l y&S' 



'B^(b(n)) 

E E ^i^i = y) 

6=1 XjeX*(b) 



M 



E E [T\n)] . 

i=l y^S' 



Consider now R*in). Since all suboptimal arms are played at most logarithmically, the total number 
of time slots in which an optimal arm is not played is at most logarithmic. It follows that the number 
of discontinuities between plays of any single optimal arm is at most logarithmic. For any optimal arm 



31 



i G {1, • • • , M} we combine consecutive blocks in which ami i is played into a single combined block, 
and denote by X^{j) the j-th combined block of ami i. Let 6* denote the total number of combined 
blocks for arm i up to block b. Each thus consists of two sub-blocks: XI that contains the states 
visited from the beginning of X^ (empty if the first state is 7*) to the state right before hitting 7*, and 
sub-block X2 that contains the rest of X* (a random number of regenerative cycles). 

Since a combined block X^ necessarily starts after certain discontinuity in playing the i-th best arm, 
6*(ri) is less than or equal to the total number of discontinuities of play of the i-th best arm up to time 
n. At the same time, the total number of discontinuities of play of the i-th best arm up to time n is less 
than or equal to the total number of blocks in which suboptunal arms are played up to time n. Thus 



k>AI 



We now rewrite R*{n) in the following from: 

M 



R*{n) 



i=l y£S' 



M 



i=l y€S' 



E E ^(^i = y) 

b'in) 

E 1^2 



+EE^;^7 

i=l y(£S' 



M 



6=1 

6-(n) 

E E ^(^i = y) 

b=l Xl€Xl(b) 
'b'(n) 

E 



-EE-;-;^ 

1=1 y€S' 
M 



(35) 



(36) 



(37) 



(38) 



(39) 



(40) 



1=1 k>M 

where the last inequality is obtained by noting the difference between (|36l ) and 071 ) is zero by Lemma 
m using positivity of rewards to lower bound (l38T l by 0, and (1351 ) to upper bound (|39] l. Combining this 



with (l27l) and (l34l) we can obtain a logarithmic upper bound on —Z^{n) by the following steps: 



i>M 

i=l k>M \ ^ ' j=l 

+ E ^^(^ma. + 1) ( . M ^1)2 + 1 + ^ E ^k.P 
i>M \ ^ ^ j=l 



We also have, 

M n M 

Y,f^'E,[n-T{n)]-E,[ J] J] r°W] < E,[n - T {n)] 

j=l t=T(n)+l Q(t)GA(i) j=l 

^ V^min iG{l,...,A'} 
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Finally, combining the above results as well as Lemma |6] we get 



M K 



n) 



Z^{n) 



1=1 



M 



+ ^li^E^[n-T{n)]- 

i>M 
M 



t=T{n)+l a{t)eA{t) 



Mi) 



i=l k>M 



4L Inn 
4L Inn 



i>M 
M 



M 



+ max n'^^^ + 1 



^ V^min ie{l,...,K} 



i>M 



M 



i>M 



i=i 



Therefore we have obtained the stated logarithmic bound for (BOl ). Note that this bound does not depend 
on 7, and therefore is also an upper bound for R{n), completing the proof. 



